Annotation Graphs: A Foundation for Integrating Tools, Formats and Corpora
نویسندگان
چکیده
In recent work we have presented a formal framework for linguistic annotations using labeled acyclic digraphs. Thesèannota-tion graphs' ooer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. We illustrate some applications to existing discourse-level annotations of text and speech data. Annotation graphs are capable of representing the structure and content of a diverse range of formats, and this opens the door to wide-ranging integration of tools and corpora. We show how the approach facilitates substantive comparison of annotations expressed in diier-ent formats and how it permits queries on corpora which have been annotated at multiple levels using diierent coding standards and tools. Finally, we describe our philosophy on tool development. 1 Annotation Graphs When we examine the kinds of speech transcription and annotation found in many existing`communities of practice', we see commonality of abstract form along with diversity of concrete format. Our wide-ranging survey of annotation practice (Bird and Liberman, 1999) attests to this commonality amidst diversity. (See www.ldc.upenn.edu/annotation] for pointers to online material.) We observed that all annotations of recorded linguistic signals require one unavoidable basic action: to associate a label, or an ordered sequence of labels, with a stretch of time in the recording(s). Such annotations also typically distinguish labels of diierent types, such as spoken words vs. non-speech noises. Diierent types of annotation often span diierent-sized stretches of recorded time, without necessarily forming a strict hierarchy: thus a conversation contains (perhaps overlapping) conversational turns, turns contain (perhaps interrupted) words, and words contain (perhaps shared) phonetic segments. Some types of annotation are systematically incommensurable with others: thus dissuency structures (Taylor, 1995) and focus structures (Jackendoo, 1972) often cut across conversational turns and syntactic consituents. A minimal formalization of this basic set of practices is a directed graph with elded records on the arcs and optional time references on the nodes. We have argued that this minimal formalization in fact has suucient expressive capacity to encode, in a reasonably intuitive way, all of the kinds of linguistic annotations in use today. We also have argued that this minimal formalization has good properties with respect to creation, maintenance and searching of annotations. We believe that these advantages are especially strong in the case of discourse annotations , because of the prevalence of cross-cutting structures and the need to compare multiple annotations representing diierent purposes and perspectives. Translation into annotation graphs does not magically create …
منابع مشابه
Polyglot and Speech Corpus Tools: A System for Representing, Integrating, and Querying Speech Corpora
Speech datasets from many languages, styles, and sources exist in the world, representing significant potential for scientific studies of speech—particularly given structural similarities among all speech datasets. However, studies using multiple speech corpora remain difficult in practice, due to corpus size, complexity, and differing formats. We introduce open-source software for unified corp...
متن کاملAnnotation graphs as a framework for multidimensional linguistic data analysis
In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These ‘annotation graphs’ offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, M...
متن کاملCommand-line utilities for managing and exploring annotated corpora
Users of annotated corpora frequently perform basic operations such as inspecting the available annotations, filtering documents, formatting data, and aggregating basic statistics over a corpus. While these may be easily performed over flat text files with stream-processing UNIX tools, similar tools for structured annotation require custom design. Dawborn and Curran (2014) have developed a decl...
متن کاملThe EUDICO Project, Multi Media Annotation over the Internet
In this paper we dsecribe a software environment that facilitates media annotation and analysis of media related corpora over the internet. We will describe the general architecture of this environment and we will introduce our Abstract Corpus Model with which we isolate corpora specific formats from the annotation and analysis tools. The main set of tools is described by giving examples of the...
متن کاملATLAS: A Flexible and Extensible Architecture for Linguistic Annotation
We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on “Annotation Graphs,” a graph model for annotations on linear sig...
متن کامل